Optimal Top-k Document Retrieval

نویسندگان

  • Gonzalo Navarro
  • Yakov Nekrich
چکیده

Let D be a collection of D documents, which are strings over an alphabet of size σ, of total length n. We describe a data structure that uses linear space and and reports k most relevant documents that contain a query pattern P , which is a string of length p, in time O(p/ log σ n+k), which is optimal in the RAM model in the general case where lgD = Θ(logn), and involves a novel RAM-optimal suffix tree search. Our construction supports an ample set of important relevance measures, such as the number of times P appears in a document (called term frequency), a fixed document importance, and the minimal distance between two occurrences of P in a document. When lgD = o(log n), we show how to reduce the space of the data structure from O(n log n) to O(n(log σ+ logD+ log logn)) bits, and to O(n(log σ+ logD)) bits in the case of the popular term frequency measure of relevance, at the price of an additive term O(logε n log σ) in the query time, for any constant ε > 0. We also consider the dynamic scenario, where documents can be inserted and deleted from the collection. We obtain linear space and query time O(p(log logn)/ log σ n+logn+k log log k), whereas insertions and deletions require O(log n) time per symbol, for any constant ε > 0. Finally, we consider an extended static scenario where an extra parameter par(P, d) is defined, and the query must retrieve only documents d such that par(P, d) ∈ [τ1, τ2], where this range is specified at query time. We solve these queries using linear space and O(p/ log σ n + log n+ k logε n) time, for any constant ε > 0. Our technique is to translate these top-k problems into multidimensional geometric search problems. As an additional bonus, we describe some improvements to those problems.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Practical Top-K Document Retrieval in Reduced Space

Supporting top-k document retrieval queries on general text databases, that is, finding the k documents where a given pattern occurs most frequently, has become a topic of interest with practical applications. While the problem has been solved in optimal time and linear space, the actual space usage is a serious concern. In this paper we study various reduced-space structures that support top-k...

متن کامل

Space-Efficient Top-k Document Retrieval

Supporting top-k document retrieval queries on general text databases, that is, finding the k documents where a given pattern occurs most frequently, has become a topic of interest with practical applications. While the problem has been solved in optimal time and linear space, the actual space usage is a serious concern. In this paper we study various reduced-space structures that support top-k...

متن کامل

Top-k document retrieval in optimal space

We present an index for top-k most frequent document retrieval whose space is |CSA|+o(n)+D log n D+O(D) bits, and its query time is O(log k log 2+ n) per reported document, where D is the number of documents, n is the sum of lengths of the documents, and |CSA| is the space of the compressed suffix array for the documents. This improves over previous results for this problem, whose space complex...

متن کامل

Top-K Color Queries for Document Retrieval

In this paper we describe a new efficient (in fact optimal) data structure for the top-K color problem. Each element of an array A is assigned a color c with priority p(c). For a query range [a, b] and a value K, we have to report K colors with the highest priorities among all colors that occur in A[a..b], sorted in reverse order by their priorities. We show that such queries can be answered in...

متن کامل

Top-k document retrieval in optimal time and linear space

We describe a data structure that uses O(n)-word space and reports k most relevant documents that contain a query pattern P in optimal O(|P | + k) time. Our construction supports an ample set of important relevance measures, such as the frequency of P in a document and the minimal distance between two occurrences of P in a document. We show how to reduce the space of the data structure from O(n...

متن کامل

Top-k Document Retrieval in External Memory

Let D be a given set of (string) documents of total length n. The top-k document retrieval problem is to index D such that when a pattern P of length p, and a parameter k come as a query, the index returns those k documents which are most relevant to P . Hon et al. [22] proposed a linear space framework to solve this problem in O(p+k log k) time. This query time was improved to O(p+k) by Navarr...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • CoRR

دوره abs/1307.6789  شماره 

صفحات  -

تاریخ انتشار 2013